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Abstract 

Peer review is a process usually conducted under conditions of extreme scarcity of resources: 
very little money bejrond minimal posta£e costs is available; the reviews are done by 
volunteers with httle time to spare; httlc clerical or technical support staff is available; and, 
for professional meetings, the schedule requires that the process be completed in little time. 
These constraints usualty prevent the practical application of reviewer training and the use of 
manual data enUy and general purpose data-base or statistical programs to detect and off-set 
the effects of differences in the strmgency (i.e., standards) of reviewers. This paper describes 
a computer based, performance ranng mtormation processing system, performance rating 
theory and programs for the application of the theoiv to obtain ratings free from the effects 
of reviewer stringency. In spite of the otherwise usual lack of /esources, the prior existence of 
these systems, onginally developed for the assessment of ihe clinical performance of students 
m health professions educational pro-ams, provided the practical capability for controlling 
reviewer stringency in the peer review process for an mtemational research conference. 
(Results of this application are men in a separate paper.) Improvements in the peer review 
process can be obtained through the use of appropnate speaalized data management and 
analysis systems. As systems similar to those described here become more generally 
available, we may expect a concomitant improvement in the reliability and validity of formal, 
technical peer review processes. 
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Controlling Rater Stringency in Peer Review 



Gerald J. Cason, Ph.D. and Carolyn L. Cason, Ph.D 
University ^ Arkansas for Medical Sciences 
Little Rock, Arkansas, USA 72205 



Where the peer review process involves a large number of reviewers and proposals or 
manuscripts to be processed in a relatively brief period, for example a large mtemational 
scientific conference, the control of rater bias requu-es bothpractical, logistical capabilities as 
well as a theoretical understanding of the rating process. The capabilities brought to bear in 
processing the reviews of abstracts submitted from the Americas, Pacific Oceana, India and 
Asia to Sigma Theta Tau (USA) for this conference were originally developed to address 
needs and problems in the assessment of the clinical performance of stuoents in health 
professions training programs. The purpose of this paper is to briefly describe our 
performance ratins information processing system, our performance rating theory, and their 
application capabilities. The prior existence of these capabilities permitted: (a) setting up the 
data collection procedures, collecting the data, and processing tiie data on a very brief 
schedule, but with no special st£^Gng and ahnost no budget beyond postage costs; and, (b) 
once the data were in machine (i.e., magnetic) form, obtaining measures of abstract quality 
that were independent of the standards of the spedfic reviewers who happened to review 
particular abstracts. Here our performance rating ^tem and our rating tneory are briefly 
described. Their actual application to the reviews ot abstracts for this conference is given in 
another paper in this symposium (C. Cason, Cason & Redland, 1987). 

Performance Rating System 

We developed the Performance Rating (PR) enhancement to the UAMS Objective Test 
Scoring (OTS) system to assist the clinical teacher to assess the clinical performance of 
students m much the same way that the OTS portion supports the classroom teacher. The 
system has proven useful in a wide range of applications, including evaluating clinical 
performance of nursing students, medical students and residents; the teaching peiformance 
of faculty; and, in one previous study, processing reviews of proposals for meetings of an 
international scientific organization (Cason, Cason & Stritter, 1986a; 1986b). 

Because of the PR system's original purpose mmy of its capabilities are irrelevant to the 
present case of peer review data. For example, the PR system provides records keeping 
across multiple assessments (e.g., examinations and clinical performance rating occasions); 
allows the use of multiple, different rating inventories within a course, each inventory being 
tailored to the performance evaluation needs of specific clinical settings; and, easily allows 
the integration of scores from many different assessment methods: essays, multiple-choice 
questions, performance ratings. These capabilities have been previously described in detail 
with extensive examples of specific clinical performance ratine applications (Cason, Schoultz, 
Cason, Glenn, Jones, Golden, Lang, & Doyle, 1986) and shaU not be elaborated here. Only 
those features of the PR ^tem that are relevant to the present topic are addressed in detail. 
In general, the capabilities of the PR ^tem which make it useml for processing the peer 
reviews of abstracts submitted to a conference such as tiiis include: the ease and speed with 
which a new inventory majr be implemented; the low cost and speed of data collection, 
processing and reporting arising from the use of optically scanned rating sheets; and, the 
appropriateness of its reports. 

Tlie OTS-PR system's 70 modular FORTRAN programs run on a mainframe (Digital 
Equipment Corporation VAX-8530) computer. But, to ease accessibility to the faculty, tiie 
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user interacts with a full-time Assessment Scoring Coordinator (ASC), who in turn interacts 
with the OTS-PR system programs, scanner, and so forth. Annually, the ASC routinely 
supports an average of about 100 courses and other implications of O iS-PR. This service is 
awable year round to faculty users without charge for internal projects and approved 
external projects, such as the support of analysis of abstract reviews tor this conference. 
Thus, produdnx the abstract review rating sheets, processing Uiem through the scanner, and 
obtaining OTS-PR reports was accomplished without specid staffing or ninding (other than 
postage costs) for the project Further, these routine services provided bv the ASC were 
delivered on the normal delivery schedule for axy user and thus were well within the time 
requirements for this application. It normally takes the ASC one day to set up a new user's 
area in the computer. For a rating inventory not previously used in the ^tem, three days are 
needed to deliver the scannable rating sheets (in total quantities of up to about 3000 sheets). 
When completed rating sheets (i.e., containing rating data) are returned by the user to the 
ASC, processing is completed and standard reports are normally available the following 
working day. The only unusual requirements associated with the processing of the peer 
review of research at)stracts invohred some proo-amming changes to accommodated 400 
subjects (i.e., abstracts) rather than the normnl 200 maximum used in processing student 
peiformance data. 

Figure 1. Blank General Purpose Rating Sheet 
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For this application we provided the ASC with the rating ipventoiy (i.e., list of rating 
criteria and scale definitions) and a list of "names" of subjects to be rated (i.e., PI, P2, etc., 
for proposal 1, 2 etc.). This information was entered in the computer by the ASC. OTS-PR 
programs used this information, subject identification numbers, blank rating forms 
(illustrated in Figure 1), and the computer's line printer to produce as many copies of the 
sheet per abstract as we needed for distribution to tne reviewers. 

Figure 2. Rating Sheet with Generic Clinical Inventoiy 
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Figure 2 provides an illustration of the rating sheet with a hypothetical clinical 
performance ratincinventory printed on it. Field tests with prototype rating data processing 
programs (drca 1978-79) demonstrated that it is essential to keep to a bare minimum the 
quantity of data manually entered on the rating sheet, especially that data entered by the 
rater. Otherwise, errors increase and excessive time for recording information reduces the 
acceptability and u&efuhiess of the system. For this reason, the computer's line printer over- 
prints both identifying information and the inventory's text on the machine scannable sheets, 
rhis includes "slu^g'' identification data in the scannable data grids. Using the line printer 
to print both the inventory and subject identi^ng data on the rating sheet provides the user 
maximum flexibility, ease of editing and revision, while also minimizing the quantity of 
information that must be manually entered on the sheets. 

As can be seen in Figures 1 and 2, the sheet provides room for up to 40 one-line criteria of 
35 characters each. The ^tem permits sub-scales and multi-line rating items. The rater 
records his or her judgment by marking a circle: numbered 1 through 5 (with 5 always best), 
or labeled "no opinion^ or "not applicable." Space for written comments is provided on the 
back of Uie sheet. The subject identification number printed on the rating sheet is not 
confidential. A different, confidential one is used in reports and records in the OTS-PR 
system. 
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For this application one of the options of the system was used and required unique rater 
ID numbers to be manually entered on the sheets. This permitted the automatic generation 
of reports on the performance of each abstract reviewer. Rater ED numbers were entered on 
the sheet by a member of the Sigma Theta Tau central office clerical staff. 

TTie reports had by a specific user are determined by that user's selection from a menu of 
17 available reports (some of which are listed in Table 1). Examples of only the three most 
relevant to the current topic, peer review for an international conference, are illustrated here. 

Table 1: Partial List of OTS-PR Reports 

For Current Assessment (Test or Rating) 
Assessment Instrument Analysis 
Analysis Summary 
Item Analysis 
Subject (e.g., stiident, abstract) Performance 
Department (detailed scores for archive) 
Subjects by Rank Order 
Histogram of Subject Scores 



Individual Subject Performance 
Individual Rater (e.g., reviewer) Performance 

Across Multiple Assessments (Tests or Ratings) 
For Coorainator's Use 

Alpha-ordered, aU subjects' scores to date 

Rank ordered, subjects' cumidative totals 
Subject's Individual Cumulative Scores 
Posting (subjects' names excluded) 

ID-ordered, all scores to date 

ID-ordered weighted totals 

The Individual Performance Report (IPR) illustrated in Figure 3 provides information on 
a single subject (e.g., student, absti-act) regarding a single rating occasion. The IPR gives 
item, subscale and total average rating in both graphic and tabular form. In the graphic part 
of the report, the V profile makes it easy to rapidly determine, by visual inspection, this 
subject's relative strengths and weaknesses. The V profile provides a comparison with 
average ratings obtained by all members of the group on whom data were indued when the 
report was prepared. When unique rater ID numbers are used, the report on an individual 
rater is similar in structure and appearance to the IPR. However, the individual rater report 
gives the average rating that the rater assigned to subjects he or she rated compared with the 
average of rater averages for each criterion. 

Tne Students by Rank Order report illustrated in Figure 4 provides information on the 
whole group of subjects rated on a particular occasion. It gives a listing of all subjects (e.g., 
abstracts) from highest-scoring to lowest-scoring with total score reported in several units of 
measurement, e.g., percentage, rank, and Z-score (standard scores with mean = 500; s.d. = 
100). Also given is the total number of raters (or rating sheets) that the total for each subject 
was averaged across. As Figure 4 shows, the number of raters per subject need not be 
constant. A similar Raters by Rank Order Report is also available. Raters are ranked by the 
averaee total rating they each assigned to subjects. 

The Rating Analysis Summaiy Report is illustrated in Figure 5. It provides information 
on the performance of the assessment inventory and procedure. Information is given at the 
category (sub-scale) and total score level. If two or more raters rated each subject (student, 
abstract), then meaningful inter-rater reliabilities are reported. (When only one rater rates 




ERIC 



7 



Cason & Cason: Controlling Rater Stringency 



Page 7 



Figure 3. Individual Performance Report 
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Figure 4. Students by Rank Order Report 
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Figure 5. Rating Analysis Summary Report 
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each subject a meaninsless zero is reported.) Because the number of raters rating each 
subject may vai>r, the reliabili^ of the average (geometric mean) number of raters rating the 
average subject is reported. Tlie information provided in the rating anafysis sunmiary report 
can assist the user to improve both the reliability and validity of the rating inventory throuj^ 
practical trial, judicious editing, and revision. At UAMS the senior author of this paper and 
other members of the staff of the OfGce of Educational Development provide assistance to 
faculty users of OTS-PR to help them make best use of the information given in the rating 
analysis report 

One of the uses of the rater reports is the determination of the presence of differences in 
raters' standards, i.e., their stringency in evaluating the subjects of tne assessment. If subjects 
are randomly assigned to raters, and each rater rates a sufficient number of subjects, tnere 
should be only a small variation in the observed mean rating given by each rater. Except in 
those setting where extensive and repeated rater training occurs, ordinarily there are 
practically important differences in the standards of different raters. While OTS-PR 
currently provides reports from which this mav be inferred, it provides no way to test this 
inference, nor to off-set such differences should they exist. The development and application 
of our performance rating theory addresses this problem. 

General, Quatitatb^e Rating Theory 

Our performance rating theory (Cason & Cason, 1981; 1984; 1986) evolved in response to 
our concern about the reliabili^ and validity of ratings-based measures of complex human 
performance (and products), onginally the patient care activities of health care professions 
students, especialW where the usual methods of controlling systematic rater error (i.e., rater 
fraining, improved inventories, more raters per student, dl raters rating all students) were 
frequently mipractical. Esscntialljj the same concerns arise for the same reasons in the 
tj^ical process used to evaluate scientific products: proposals, abstracts, manuscripts. Our 
theory and its derivative simplified model of performance rating were developed to provide a 
mathematical basis for controlling systemic rater error or bias arising from differences in the 
standards and other characteristics of the raters (e.g., teachers, reviewers) who happen to 
judge an individual subject (e.g., student, abstract). 

Figure 6. General Rater Characteristic Curve (RCC) 
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Our performance rating theory (Cason & Cason, 1984) posits that the rating received by a 
subject (e.g., student, abstract) is a function of the subject's true ability (i.e., competence, 
qualiw) and the rater's (e.g., teacher's, reviewer's) characteristics. The rater's characteristics 
include : (a) resolving power, i.e., the capacity to assign different ratine to different amounts 
of subject ability, (b) sensitivity, i.e., the maximum value of the raters resolving power, (c) 
stringenqr, i.e., the general tendency to require more or less ability for a given rating when 
other characteristiis are equal across raters, and (d) efltetive rating floor and ceiliDgH-e., the 
m i nimum and m aximu m ratings a rater will actuaUy assign in spite of the ostensible range, 
e.g., that printed on the rating inventory or scale. It is these characteristics (and random 
error) that determine the relationship between subject ability and assigned rating; this 
relationship is illustrated by the rater characteristic curve (RCC) in Figure 6. The RCC 
arises from the net, joint effects of the rater's knowledge, understanding, and beliefs about 
the task to be performed, its difBculty, constraints imposed by the setting or situation, the 
rating procedure (including the inventory used), and related factors. 

Rater resolving power and the notion of a rater's pivotal rating standard are the two most 
primitive concepts underlying the theory Resolving power is reflected in the slope (i.e., 
steepness) of the rater characteristic curve fRCC) at a given point. Tne assumed nature of 
the chanee in rater resolving power, as subject ability varies from very low (i.e., a great 
dist< nee below the rater's pivotal standard) to very high (i.e., a great distance above the 
rater's pivotal standard), implies an s-shaped curve. Resolving power is not treated as a 
formal parameter of the theory. The four formal rater parameters of the theory, implied by 
the rater characteristics given above, are associated with mathematical parameters of the 
RCC. The first, sensitivity, is measured by the slope of the RCC at the point which evaluates 
to a rating half way between the rater's effective rating floor and ceiling. The projection of 
this point on the stringency scale is the rater's pivotal standard or rater reference point 
(RPP). The effective rating floor and ceiling, i.e., the asymptotic minimum and maximum 
ratings that the rater will actually assign (and not necessarily 0% and 100%, respectively) may 

Figure 7. RCCs for Simplified Model: Raters of Stringencies K and L Give Subject of 
Ability S Ratings RA and KB, Respectively. 
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be viewed as the third and fourth formal parameters of the theory. Thus, the rating obtained 
by the subject is a function of the subiject^s ability point (SAP), i.e., the subject's true ability, 
and the location and shape of the RCC as defined by its general equation and values of its 
fourparameters. 

m our empirical work we use a simplified model of our theoiy to ease the problems of 
estimating the values of the formal subject and rater parameters. Our simpnfied model 
(Cason & Cason, 1984) accounts for all systematic variation in performance ratings 
exclusively by variation in one rater parameter (i.e., stringen^) and one subject parameter 
(i.e., ability or quality). As illustrated in Figure 7, this simplified model assumes teat (a) all 
raters have equal sensitivity (i.e., the slopes of the RCCs are equal) and (b) all raters have 
effective rating floors and ceiling of 0% and 100%, respectively. The model is applicable 
where there is su£Gcient over-lap m who rates whom, i.e., where mere is sufficient coupling of 
the data. This coupling is frequently present in the structure of data found in health 
professions clinical education settmgs, i.e., where each student is rated by several but not all 
raters and each rater rates several bat not all students. This structure is, of course, frequently 
found in the reviews of scientific proposals and abstracts. 

The nccessaiy coupling between data points (i.e., ratings) may be understood by analogy 
to acquaintanceship relationships. If a ratmg is assumed to be a metaphorical handshake and 
this is taken to mean the rater and subject are acquainted, then tiie sunplified model applies 
to sets of data in which there is a path of mutual acquaintance leading from any subject or 
rater to every other subject and rater, 

>^plication of the model in previous studies of clinical performance rating permitted off- 
setting variations in rater stringency and calculation of adjusted ratings having improved 
reliability and validity in each of 17 independent data sets obtained from two schools, with 
different amounts of rater training, different rating inventories (one behaviorally anchored, 
one not), and each inventoiy having different levels of trait specificity. C. Cason, Cason, and 
Littiefield (1983) further demonstrated that the model was equally applicable to each of two 
commonly cited (c.g., Dielman, Hull, & Davis, 1980) dimensions of cfinical performance (i.e., 
cognitive-techniod versus affective-interpersonal skills). The application of the model was 
demonstrated to be equally useful on the ratings on paper proposals considered for 

gresentation at three meetings of an international scientific organization (Cason, Cason & 
tritter, 1986a; 1986b). 

Simptified, Quantitative Rating Model 

in our simplified model the expected subject rating (ESR), expressed as a percent of the 
maximum possible ratinjg, is a function of the difference, z, between the rater's stringency 
(i.e., value associated with the rater reference point or RRP) and the subject's ability (i.e., 
value associated with the subject ability point or SAP). This relationship is modified by an 
arbitrary scaling factor (SF = 100). 

z = (SAP-RRP)/SF [1] 

The theoretically postulated curvilinear (s-shaped) relationship between z and the expected 
subject ratina (ESR) has been arbitrarily stipulated as the unit-normal ogive. Thus, the ESR 
(m percent) for a given z is equal to 100 times p(z), the area under the normal curve below z; 
that, is: 

ESR = p(z) • 100 [2] 

This is a deterministic not a probabilistic relationship. The model predicts a point value for 
the expected rating, not a probability distribution. 
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In previous research the observed subject rating (OSR) was defined as equal to ESR plus 
(random) error: 

OSR = ESR + error [3] 

Convertinefrom percent to proportion (dividing by 100) and substituting the definition of 
ESR from Equation 2 gives: 

OSR = p(z) + error [4] 

Because OSR is (now) a proportion, it must fall between 0 and 1. From Equation 4, it 
follows that the sum of the error and area below z also must fall between 0 and 1. That is, 
the sum of the expressions on the right side of Equation 4 may be treated as a proportion. 
Without asserting anything different about the psychological location of the random error in 
the rating process, the model may be expressed as: 

OSR = p(z + error) [5] 

Taking the inverse normal probability (ZIN, i.e., obtaining the z associated with a given 
proportion) of both sides of Equation 5 gives: 

ZIN(OSR) = z + error [6] 

That is, Equation 6 shows that the inverse z-transforra of the observed ratings is composed of 
the difference (z) between subject ability (SAP) and rater stringency (RRP) plus random 
error. Equation 6 permits the application of regression analysis to estimate theseparameter 
values rather than the less well known procedure used in earlier studies to solve Equation 3. 
While Equation 3 was convenient for earlier studies. Equation 6 v/ill provide equivalent (to 
within a linear transformation) estimates of subject ability and rater stringency using a more 
generally familiar method. 

Regression Based Estimates of Parameters 

Estimation of the model parameters (i.e., RRPs and SAP's) is accomplished in two 
phases. First, the observed rating are transformed to proportions then to z's (using an 
inverse normal probability function). These z's are used as the criterion values (Y vector) in 
a regression model of the general form of Equation 7. The z's in the criterion vector may be 
thought of as distances on the underlying stringency-ability scale (but containing error as in 
Equation 6) between RRPs and SAP's which are implied by the onginal observed ratings. 

Y = cU + bjRl + b2R^ + ... + bnR" + 

bn+lS"^^ + b„+2S"^2 + ... + b^^^S^^^ + E [7] 

where: 

Y is the criterion vector; 

U.is a unit vector containing a 1 for each observation in Y; 

R' (i= 1 through n; n= number of raters) is a vector containing a 1 if the observation in Y 
pertains to a rating given by rater i, zero otherwise; 

SJ (j =n+ 1 through n+ k; k =number of subjects) is a vector containing a 1 if the 
observationin Y is associated with subject j-n, zero otherwise; and, 

c and bi through bn+ j, are the raw regression weights that minimize the squares of the 
values in the error vector^E). 
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A special purpose computer proeram, GENVEC (Cason, 1987) is used to generate the above 
model from mput (files from OTS-PR) which specifies a rater identification number and 
subject identification number for each observed rating. Program IMS (linear Model Solver; 
Cason, 1986), based on Ward and Jennings' (1973) prpg-am MODEL, gives an appropriate 

Zession analysis of the model generated by GENVEC. (The regression program must 
IV redundant vectors in the modd.) LMS yields least-squares, raw regression weights (b's 
not beta's) for Equation 7. As shown in Equation 8, pairs of b's and the unit vector weight 
provide an estimate of the inferred "error free" distance between a rater-subject pair on the 
stringency and ability scale: 

RXTOS (I) = BOFS(I) - BOFRX + CONST [8] 

where: 

RXTOS(l) is the distance from a rater (RX) to subject I; 
BOFS(n is the regression weight (b) of subject I; 

BOFRX is the regression wei^t (b) of an arbitrarily chosen rater (RX); and, 
CONST is the regresion constant (i.e., unit vector wei^t). 

The second phase of the process of estimating the parameter values is to convert the 
ren-ession weirfits into theoretical distances (usins Equation 8) and then into locations (i.e., 
RRPs and SAP's) on the stringency and ability scale (using Equation 9). A rater is arbitiarily 
chosen (e.g., the rater with most ratings and lowest ID number) to anchor the. scale and that 
rater's RRP is set equal to 500. Once this is done, the location of all subjects is determined 
(with respect to the arbitrary anchor point) by the linear equation: 

SLOC(I) = ANCVAL + RXTOS(I) [9] 

where: 

SLOC(I) is the location (SAP) of subject I; and, 

ANCVAL is the arbiti-ary value (e.g., 500) used to anchor the scale. 

As soon as all subjects are located, an analogous set of equations is solved to obtain the 
remaining rater locations (i.e., rater stringencies or RRFs). Program LOCATE (Cason, 
1985) is used to solve Equations 8 and 9 and the analogous equations for raters to obtain 
estimates of all model parameter values, i.e., ARFs and SAPs. LOCATE also calculates a 
calibrated or adjusted rating for each subject. The adjusted rating is calculated (using 
Equations 1 and 2) as the mean of the ratings expected from a subject s SAP and the RRFs 
of all the raters. Thus, the adjusted or calibrated rating is what the subject would have 
received (disregarding random measurement error) had all the raters rated the person's 
performance or product. 

Summary and Conclusions 

Our work over the last ten years in clinical performance evaluation yielded the theoretical 
understanding of the rating process, the computer programs and the administrative support 
unit that permitted practiced application of this understanding to the specisdized needs ot the 
abstract review process for this international research conference. This suggests that two of 
our major eoals for our performance rating system, i.e., flexibility and practicality, have been 
substantialfy achieved. It may also suggest one set of reasons why the peer review process has 
as yet been so little affected by the advent of computen. Computer programs which are 
sufficiently appropriate to the needs of peer review processes have not been generally 
available. In most peer review settings, the use of general purpose statistical or data base 
programs and manual data entry to accomplish only compilation and reporting of observed 
ratings would require impractically large expenditures of money or staff time. Even if 
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available, the specialized programs for finding calibrated ratings (from which the effects of 
differences in rater standards have been removed) may not be practically applied unless an 
economical mechanism is present for cr.nverting the abstract reviews into quantitative, 
magnetic form. Mosv problems in clinical performance evaluation and peer review are not 
responsive to quick, easy, simple or cheap solutions. But, improvements can be made 
throudi development and use of appropriate data management and analysis systems. Even 
though such systems are very expensive to develop, the costs can be justified it the resulting 
systems have sufBcient flexibility to be broadly applicable to peer review and or performance 
evaluation. As systems similar to OTS-PR become more commonly available, we may expect 
a concomitant improvement in the reliability and validity of formal, technical peer review 
processes. 
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